Goto

Collaborating Authors

 output prediction




CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

Lam, Man Ho, Wang, Chaozheng, Huang, Jen-tse, Lyu, Michael R.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with 1,279 questions from CruxEval and LiveCodeBench, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) contexts. Through a systematic evaluation of 17 LLMs, we find that models often shortcut reasoning by over-relying on NL cues, leading to an average performance degradation of 23.2% in output prediction tasks. Even with Chain-of-Thought reasoning, models on average still have a 13.8% drop due to distractibility and rationalization, revealing a lack of critical reasoning capability to distinguish the actual code behaviors. While Large Reasoning Models with internal reasoning mechanisms improve robustness by fostering critical thinking, plausible yet incorrect hints can trigger pathological self-reflection, causing 2-3 times token consumption and even catastrophic cognitive dissonance in extreme cases for QwQ-32B. We refer to this phenomenon as Reasoning Collapse. CodeCrash provides a rigorous benchmark for evaluating robustness in code reasoning, guiding future research and development toward more reliable and resilient models.


Training Code Language Models with Comprehensive Semantics Reasoning

Neural Information Processing Systems

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data


CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Roy, Monoshi Kumar, Chen, Simin, Steenhoek, Benjamin, Peng, Jinjun, Kaiser, Gail, Ray, Baishakhi, Le, Wei

arXiv.org Artificial Intelligence

Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at https://codesense-bench.github.io/.


Embedding Inference for Structured Multilabel Prediction

Farzaneh Mirzazadeh, Siamak Ravanbakhsh, Nan Ding, Dale Schuurmans

Neural Information Processing Systems

A key bottleneck in structured output prediction is the need for inference during training and testing, usually requiring some form of dynamic programming. Rather than using approximate inference or tailoring a specialized inference method for a particular structure--standard responses to the scaling challenge-- we propose to embed prediction constraints directly into the learned representation. By eliminating the need for explicit inference a more scalable approach to structured output prediction can be achieved, particularly at test time. We demonstrate the idea for multi-label prediction under subsumption and mutual exclusion constraints, where a relationship to maximum margin structured output prediction can be established. Experiments demonstrate that the benefits of structured output training can still be realized even after inference has been eliminated.


Learning Structured Output Representation using Deep Conditional Generative Models

Kihyuk Sohn, Honglak Lee, Xinchen Yan

Neural Information Processing Systems

Supervised deep learning has been successfully applied to many recognition problems. Although it can approximate a complex many-to-one function well when a large amount of training data is provided, it is still challenging to model complex structured output representations that effectively perform probabilistic inference and make diverse predictions. In this work, we develop a deep conditional generative model for structured output prediction using Gaussian latent variables. The model is trained efficiently in the framework of stochastic gradient varia-tional Bayes, and allows for fast prediction using stochastic feed-forward inference. In addition, we provide novel strategies to build robust structured prediction algorithms, such as input noise-injection and multi-scale prediction objective at training. In experiments, we demonstrate the effectiveness of our proposed algorithm in comparison to the deterministic deep neural network counterparts in generating diverse but realistic structured output predictions using stochastic inference. Furthermore, the proposed training methods are complimentary, which leads to strong pixel-level object segmentation and semantic labeling performance on Caltech-UCSD Birds 200 and the subset of Labeled Faces in the Wild dataset.


Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Orvalho, Pedro, Kwiatkowska, Marta

arXiv.org Artificial Intelligence

Understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies have assessed LLMs' ability to predict program outputs, most focus solely on the accuracy of those predictions, without evaluating the reasoning behind them. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this work, we evaluate whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated six LLMs and performed a human expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that LLMs trained for code produce correct predictions based on flawed reasoning between 10% and 50% of cases. Furthermore, LLMs often change predictions in response to our code mutations, indicating they do not yet exhibit stable, semantically grounded reasoning.


Learning Library Cell Representations in Vector Space

Liang, Rongjian, Lu, Yi-Chen, Liu, Wen-Hao, Ren, Haoxing

arXiv.org Artificial Intelligence

--We propose Lib2V ec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells, enabling ML models to capture essential cell semantics. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised learning scheme that systematically extracts training data from Liberty files, removing the need for costly labeling; and (3) an attention-based model architecture that accommodates various pin counts and enables the creation of property-specific cell and arc embeddings. Experimental results demonstrate that Lib2V ec effectively captures functional and electrical similarities. Moreover, linear algebraic operations on cell vectors reveal meaningful relationships, such as vector(BUF) - vector(INV) + vector(NAND) approximating the vector of AND, showcasing the framework's nuanced representation capabilities. Lib2V ec also enhances downstream circuit learning applications, especially when labeled data is scarce. Library cell representations are vital for effective machine learning (ML)-based circuit analysis and optimization, as library cells are the fundamental building blocks of circuit netlists. Traditional methods often rely on manually defined features [1]-[4], requiring extensive expertise and feature engineering. Alternatively, one-hot encoding [5] demands large amounts of domain-specific training data, which may not always be available.


What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Armengol-Estapé, Jordi, Carbonneaux, Quentin, Zhang, Tianjun, Markosyan, Aram H., Seeker, Volker, Cummins, Chris, Kambadur, Melanie, O'Boyle, Michael F. P., Wang, Sida, Synnaeve, Gabriel, Leather, Hugh James

arXiv.org Artificial Intelligence

O'Boyle 1, Sida Wang 2, Gabriel Synnaeve 2, Hugh Leather 2 1 University of Edinburgh 2 Meta AI A BSTRACT Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications. Current state-of-the-art general-purpose LLMs are thought to contain considerable proportions of code in their pretraining data (OpenAI et al., 2024), which is known to improve reasoning capabilities even in tasks seemingly unrelated to code (Aryabumi et al., 2024). However, datasets used to train code LLMs (such as Lozhkov et al. (2024)) typically treat code as static strings and rarely exploit the dynamic information about their execution. Executability is one of the key differences between code and natural language, and most code datasets neglect dimensions of the code domain such as reasoning over code execution, which in turn could lead to better code understanding. This fundamental limitation has sparked a renewed interest in modeling program executions, connecting with the pre-LLM neural program evaluation literature (Zaremba & Sutskever, 2014; Graves et al., 2014), which studied whether neural networks could learn to execute programs.